DOMAIN: Semiconductor manufacturing process

CONTEXT:

A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

DATA DESCRIPTION: sensor-data.csv : (1567, 592)

The data consists of 1567 examples each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.

PROJECT OBJECTIVE:

We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

Import and explore the data.

First column is  (Python) objects data type.
Last column is Integer data type.
Rest of columns are of float data type.
Checking null value in the available dataset

Data cleansing

Missing value treatment.
There are no null available in the 'Pass/Fail' column (target).

Checking percent of missing values in below cell. This is to understand data better.
Imputing the dataset with 0 as fill value.
Lets check the correlation and if 2 variables are highly correlated then we can plan to drop those variables.
The correlation matrix can be used to drop unwanted columns
The above dataframe will be used to determine right number of column to be dropped in further steps.
Lets check the correlation by plotting graphs

Data analysis & visualisation:

Using ChiSquare we determine importance of the variable - if the variable if not important
The above columns are of statistical improtance in the dataset provided. Since there are many column , only these can be used for further data analysis.
After observing the distribution plot, most columns have normal distribution and few tail end column (like column 419 and later) have either right skewness or left skewness to the data. There are few column for which bimodality is observed. Since these data are machine generated, generally it is more reliable and can be directly used by the algorithm. There are very less chance that there will be flaw in the data collection.
boxplot on each column shows the data distribution, There are outliers observed in many plot. Since the data provided is a machine generated one, here we can plan not to treat the outliers. Sicne the problem provided is a binary classification, it is always better plot the graphs based on class for understanding overlapp of data.
In above box plots we can see difference in data distribution in the class. Although there are few overlaps, there are data points gives us hint about difference in classes. Here we need to oversample the data as the target class is having very less instance for fail class.
 Target column –1 corresponds to a pass and 1 corresponds to a fail. Here we can clearly observe the data behaviour which helps us in the difference in class for a particular column. Usual observation is that fail cases are having difference mean and median than that of pass cases.
The dataset provided is not balanced, there are less "Fail" cases.
 Most column in the dataset which is cleaned by dropping unnecessary column are important.
Data pre-processing:
• Segregate predictors vs target attributes
• Check for target balancing and fix it if found imbalanced.
• Perform train-test split and standardise the data or vice versa if required.
• Check if the train and test data have similar statistical characteristics when compared with original data.
We will consider entire dataset for data preprocessing, later we can use other dataset with reduced column for some performance improvements.
Segregate predictors vs target attributes
Check for target balancing and fix it if found imbalanced.
The available dataset is imbalanced dataset where in "Fail" ( value as 1  in 'Pass/Fail' column) is having less percent of data.

Apply Smote technique to balance dataset

Below mentioned method is just for future use where in it will help for the question of implementing own sampling technique.

Apply sampling technique to balance dataset

Perform train-test split and standardise the data or vice versa if required.
Using standard scaler to standardise the data 
Check if the train and test data have similar statistical characteristics when compared with original data
Shape of the data are same and we can compare as mentioned below
No significant difference found in the split dataset
Since it is very convinent to compare using CSV file, the above dataframe are stored in CSV and compared for any changes in the data. There is not significant change in the main dataset and test, train split dataset.

Model training, testing and tuning

Model training
Pick up a supervised learning model and Train the model.
When we run the LogisticRegression on the Raw data i.e, without applying PCA, standardization, without attribute removal. It is not providing right output. It is having very low recall and precision. 31 are actual fail but termed as pass which indicated some correction in the algorithm is required. There are no True positive cases which indicates that the algorithm is baised toward the pass class. We can apply target balancing and then compare the results.
Here we observe that accuracy is pretty high but other metrics are showing low numbers.
Use cross validation techniques.
Here we observe that accuracy of the model us around 90% with 3% deviation. So the model performance in production will be between 87-93. This can be improvised by applying few performance improvement techniques such as -  Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing 
Note : hyper-parameter tuning techniques will be applied once the target balancing is applied on data so we can get best parameters.
Use any other technique/method which can enhance the model performance
Here we observe that Accuracy, precision and recall improved by applying few other technique/method, this has resulted in improvement in the model performance. We shall apply other target balancing technique to further improve model performance and reduce Type 1 error. There are true positive cases observed which is good sign.
After applying custom method to change the population - the performance of the model has improved. Here we have use statistical method to upsample the data. Here we observe that the recall and precision has improved. Number of devices which are actually defective but which are predicted as pass has come down from previous iteration.
Apply hyper-parameter tuning techniques to get the best accuracy.
Suggestion: Use all possible hyper parameter combinations to extract the best accuracies
Although we observe high accuracy and recall score, it is better to have a look into classification matrix.
True Positives (TP): predicted true conversion rate 104
True Negatives (TN): predicted false conversion rate 393
False Positives (FP): 11 Falsely predict positive Type II error
False Negatives (FN): 47 Falsely predict negative Type I error
There are about 11 instance where the model has predicted wrong and has predicted as pass when actually it is a fail.
Apply the above steps for all possible models that you have learnt so far.
Display and compare all the models designed with their train and test accuracies.
For normal dataset, which is not  oversampled
Reason for choosing model

As observed in all 3 result table, there are various models to compare. Among them Logistic Regression, SVC seems to have good overall score, here XGB and MLP are having 100% accuracy in train data - which seems like some overfitting on train data has happened but looking at confusion matrix there are very low error on mis-classification. The confusion matrix indicates that we could use MLP as a final algorithm of choice for given data. If we have more number of instances collected then we could try MLP, SVC and XGB and conclude best algorithm. For now based on the recall and precision socere and observing confusion matrix we can choose MLP as choice for building model
Select the final best trained model along with your detailed comments for selecting this model
Pickle the selected model for future use
Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results
Conclusion and improvisation:
Write your conclusion on the results
The result shows good prediction and has correctly predicted pass cases and three fail cases. However there is one wrong prediction observed. The confusion matrix gives us good idea on how classification is done on the train data. And other metrics shows how the performance is on test dataset. 
We could plan to collect more data and also instruct team on feature importance by which they can spend more time on correct data collection and thus help us improvising the model performance.